-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add metric is_eligible_for_master_election to support reelection decision in codis-sentinel #2766
fix: add metric is_eligible_for_master_election to support reelection decision in codis-sentinel #2766
Conversation
…instance has corrupted full sync, which can be used in codis-pika cluster reelection scenario
use Is_eligible_for_master_election in sentinel
WalkthroughThe recent changes primarily revolve around enhancing the master election process in the Codis and Pika systems. A new field Changes
Assessment against linked issues
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (10)
- codis/pkg/models/group.go (2 hunks)
- codis/pkg/topom/topom_group.go (1 hunks)
- codis/pkg/topom/topom_sentinel.go (2 hunks)
- codis/pkg/utils/redis/sentinel.go (2 hunks)
- conf/pika.conf (1 hunks)
- include/pika_conf.h (4 hunks)
- src/pika_admin.cc (4 hunks)
- src/pika_conf.cc (4 hunks)
- src/pika_db.cc (1 hunks)
- src/pika_repl_client_conn.cc (1 hunks)
Files skipped from review due to trivial changes (2)
- conf/pika.conf
- src/pika_db.cc
Additional comments not posted (13)
codis/pkg/models/group.go (2)
34-34
: Ensure logical correctness inSelectNewMaster
method.The modification in the
SelectNewMaster
method now considersIsEligibleForMasterElection
to determine the eligibility of a server for election as master. This is a critical change for the election logic.
89-89
: AddedIsEligibleForMasterElection
field toGroupServer
struct.This new boolean field is crucial for managing which servers are eligible for master election. Ensure that all server instances are correctly initialized with this field to maintain state consistency across the application.
codis/pkg/utils/redis/sentinel.go (2)
75-75
: AddedIsEligibleForMasterElection
field toInfoReplication
struct.This addition is crucial for the replication logic to consider the eligibility of a server for master election. Ensure that this field is correctly handled in all relevant data flows.
112-112
: UpdatedUnmarshalJSON
method to handleIsEligibleForMasterElection
.The method now correctly parses the
IsEligibleForMasterElection
field from JSON. Ensure that the boolean parsing logic ("true"
string check) aligns with the expected JSON structure.codis/pkg/topom/topom_sentinel.go (2)
51-51
: UpdatedtryFixReplicationRelationships
method to includemasterOfflineGroups
length.The inclusion of the
masterOfflineGroups
length as a parameter is crucial for handling scenarios where the master is offline. This change should improve the robustness of master-slave switch logic.
95-95
: UpdatedcheckAndUpdateGroupServerState
method to handleIsEligibleForMasterElection
.The method now correctly assigns the
IsEligibleForMasterElection
value based on the replication state. This is a critical update for ensuring that only eligible servers can become master.src/pika_repl_client_conn.cc (1)
189-189
: Added tracking for unfinished full syncs inHandleDBSyncResponse
.Calling
AddInternalUsedUnfinishedFullSync
during DB sync response handling is a critical update for tracking the state of sync operations. Ensure that this method is called under correct conditions and handles exceptions appropriately.codis/pkg/topom/topom_group.go (1)
406-406
: UpdatedtryFixReplicationRelationship
method to handleIsEligibleForMasterElection
.The method now correctly updates the
IsEligibleForMasterElection
field based on the replication state. This change is crucial for ensuring that the master election logic respects the eligibility criteria.src/pika_conf.cc (1)
786-786
: Configuration rewriting should handle new settings appropriately.The
ConfigRewrite
function now includes theinternal-used-unfinished-full-sync
setting. It's crucial to ensure that this setting is consistently managed and rewritten accurately to avoid configuration drift or inconsistencies.Ensure that the rewrite logic is tested extensively to confirm that it handles edge cases and maintains consistency across system restarts.
include/pika_conf.h (3)
846-852
: Proper initialization and management of the internal set.The method
SetInternalUsedUnFinishedFullSync
initializes the internal set from a comma-separated string. This method correctly handles the conversion and storage, ensuring that the internal state is updated appropriately.The implementation uses
std::lock_guard
for thread safety andpstd::StringToLower
for consistent string handling. Ensure that the input string is validated or sanitized before processing to prevent issues related to unexpected input formats.
865-874
: Handling removal of sync configurations.The
RemoveInternalUsedUnfinishedFullSync
method allows for the removal of entries from the internal set. It properly locks the operation and updates the configuration file, ensuring consistency.This method is crucial for maintaining accurate state in dynamic environments. Consider adding logging or monitoring around these operations to aid in debugging and operational transparency.
876-879
: Utility method to get the count of unfinished syncs.The
GetUnfinishedFullSyncCount
method provides a simple interface to retrieve the count of unfinished syncs, which can be useful for monitoring and decision-making processes.This method is straightforward and uses
std::shared_lock
for thread-safe read access. Ensure that this method is used appropriately in monitoring or decision-making logic to provide real-time insights into the system's state.src/pika_admin.cc (1)
Line range hint
1061-1161
: Updated metrics for master election eligibility inInfoCmd::InfoReplication
.The changes correctly implement the new metric
is_eligible_for_master_election
as described in the PR objectives. The logic checks if the current instance is syncing full or has full sync corrupted, which aligns with the intended behavior to determine eligibility for master election.However, ensure that the
syncing_full_count
andGetUnfinishedFullSyncCount
methods are robust and handle edge cases appropriately. It's crucial that these methods accurately reflect the state of the system to prevent incorrect master election decisions.Verification successful
Verification Successful: Updated metrics for master election eligibility in
InfoCmd::InfoReplication
The implementation of
syncing_full_count
andGetUnfinishedFullSyncCount
has been verified. The logic correctly checks if the current instance is syncing full or has full sync corrupted, ensuring accurate determination of eligibility for master election.
syncing_full_count
is incremented correctly within the loop.GetUnfinishedFullSyncCount
returns the size of theinternal_used_unfinished_full_sync_
set, which is thread-safe.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify the implementation of syncing_full_count and GetUnfinishedFullSyncCount. # Expect: Definitions and usage in the code that ensure accuracy and robustness. rg --type cpp 'syncing_full_count|GetUnfinishedFullSyncCount'Length of output: 324
Script:
#!/bin/bash # Extract the implementation details of GetUnfinishedFullSyncCount from pika_conf.h ast-grep --lang cpp --pattern 'size_t GetUnfinishedFullSyncCount() { $$$ }'Length of output: 77
Script:
#!/bin/bash # Search for the definition and implementation of GetUnfinishedFullSyncCount across the entire codebase rg --type cpp 'GetUnfinishedFullSyncCount'Length of output: 203
Script:
#!/bin/bash # Extract the full content of include/pika_conf.h to locate the implementation details of GetUnfinishedFullSyncCount cat include/pika_conf.hLength of output: 34058
@@ -828,6 +838,7 @@ int PikaConf::ConfigRewrite() { | |||
int PikaConf::ConfigRewriteReplicationID() { | |||
std::lock_guard l(rwlock_); | |||
SetConfStr("replication-id", replication_id_); | |||
SetConfStr("internal-used-unfinished-full-sync", pstd::Set2String(internal_used_unfinished_full_sync_, ',')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplication in handling configuration rewriting.
The method ConfigRewriteReplicationID
also handles the internal-used-unfinished-full-sync
, which is already handled in ConfigRewrite
. This duplication could lead to maintenance challenges and bugs if the synchronization logic changes in one place but not the other.
- SetConfStr("internal-used-unfinished-full-sync", pstd::Set2String(internal_used_unfinished_full_sync_, ','));
Consider centralizing the handling of this configuration to ensure consistency and reduce code duplication.
Committable suggestion was skipped due to low confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal-used-unfinished-full-sync is related with replicationID, that's why I put them together there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cheniujh, thank you for the clarification. I understand now that internal-used-unfinished-full-sync
is logically related to replicationID
, which justifies handling them together in ConfigRewriteReplicationID
. To ensure consistency and avoid potential bugs, it would be helpful to document this relationship clearly. This way, any future changes to this logic can be applied consistently in both methods.
Learnings added
Learnt from: cheniujh
PR: OpenAtomFoundation/pika#2766
File: src/pika_conf.cc:841-841
Timestamp: 2024-07-03T02:23:46.544Z
Learning: The `internal-used-unfinished-full-sync` configuration is related to `replicationID`, which is why it is handled together in the `ConfigRewriteReplicationID` method.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
std::string unfinished_full_sync; | ||
GetConfStr("internal-used-unfinished-full-sync", &unfinished_full_sync); | ||
if (replication_id_.empty()) { | ||
unfinished_full_sync.clear(); | ||
} | ||
SetInternalUsedUnFinishedFullSync(unfinished_full_sync); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure robust handling of configuration parameters.
The implementation clears the unfinished_full_sync
string if replication_id_
is empty. This conditional clearing might be intended to reset the configuration under certain conditions, but it's unclear why this is tied specifically to replication_id_
. This could potentially lead to unexpected behavior if replication_id_
is not directly related to the unfinished full syncs.
- if (replication_id_.empty()) {
- unfinished_full_sync.clear();
- }
Consider documenting the rationale behind this conditional or ensuring that it aligns with the intended logic.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::string unfinished_full_sync; | |
GetConfStr("internal-used-unfinished-full-sync", &unfinished_full_sync); | |
if (replication_id_.empty()) { | |
unfinished_full_sync.clear(); | |
} | |
SetInternalUsedUnFinishedFullSync(unfinished_full_sync); | |
std::string unfinished_full_sync; | |
GetConfStr("internal-used-unfinished-full-sync", &unfinished_full_sync); | |
SetInternalUsedUnFinishedFullSync(unfinished_full_sync); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 unfinished_full_sync is related with replication_id_
2 I will add more docs/comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cheniujh, glad you agree! I'll make sure to remember this for future reviews.
Learnings added
Learnt from: cheniujh
PR: OpenAtomFoundation/pika#2766
File: src/pika_conf.cc:188-193
Timestamp: 2024-07-03T02:24:14.326Z
Learning: The `internal-used-unfinished-full-sync` configuration is handled without conditional clearing based on `replication_id_`.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
Learnings used
Learnt from: cheniujh
PR: OpenAtomFoundation/pika#2766
File: src/pika_conf.cc:841-841
Timestamp: 2024-07-03T02:23:46.695Z
Learning: The `internal-used-unfinished-full-sync` configuration is related to `replicationID`, which is why it is handled together in the `ConfigRewriteReplicationID` method.
void AddInternalUsedUnfinishedFullSync(const std::string& db_name) { | ||
{ | ||
std::lock_guard l(rwlock_); | ||
internal_used_unfinished_full_sync_.insert(db_name); | ||
std::string lower_value = pstd::Set2String(internal_used_unfinished_full_sync_, ','); | ||
pstd::StringToLower(lower_value); | ||
TryPushDiffCommands("internal-used-unfinished-full-sync", lower_value); | ||
} | ||
ConfigRewrite(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamic management of sync configurations.
The AddInternalUsedUnfinishedFullSync
method dynamically updates the internal set and rewrites the configuration. This approach is flexible but could lead to performance issues if called frequently due to the repeated calls to ConfigRewrite
.
- ConfigRewrite();
+ // Consider deferring the rewrite operation or batching updates to improve performance.
Evaluate the impact of these operations on system performance, especially under high load, and consider optimizing by batching or deferring updates.
Committable suggestion was skipped due to low confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConfigRewrite() is not that expensive although it write files, and more important, AddInternalUsedUnfinishedFullSync(const std::string& db_name) will not be called frequently and which is the main reason of no need of worry about performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cheniujh, thank you for the clarification. It's good to know that ConfigRewrite()
is not expensive and AddInternalUsedUnfinishedFullSync
will not be called frequently. This alleviates concerns about potential performance issues.
Learnings added
Learnt from: cheniujh
PR: OpenAtomFoundation/pika#2766
File: include/pika_conf.h:854-863
Timestamp: 2024-07-03T02:45:10.624Z
Learning: The `ConfigRewrite()` function is not expensive, even though it writes files, and the `AddInternalUsedUnfinishedFullSync(const std::string& db_name)` method will not be called frequently, thus minimizing performance concerns.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
Learnings used
Learnt from: cheniujh
PR: OpenAtomFoundation/pika#2766
File: src/pika_conf.cc:841-841
Timestamp: 2024-07-03T02:23:46.695Z
Learning: The `internal-used-unfinished-full-sync` configuration is related to `replicationID`, which is why it is handled together in the `ConfigRewriteReplicationID` method.
Learnt from: cheniujh
PR: OpenAtomFoundation/pika#2766
File: src/pika_conf.cc:188-193
Timestamp: 2024-07-03T02:24:14.456Z
Learning: The `internal-used-unfinished-full-sync` configuration is handled without conditional clearing based on `replication_id_`.
fix: IsEligibleForMasterElection slaveof force
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- codis/pkg/topom/topom_group.go (3 hunks)
Files skipped from review as they are similar to previous changes (1)
- codis/pkg/topom/topom_group.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/pika_conf.cc (4 hunks)
Files skipped from review as they are similar to previous changes (1)
- src/pika_conf.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- codis/pkg/topom/topom_group.go (3 hunks)
Additional comments not posted (2)
codis/pkg/topom/topom_group.go (2)
406-406
: LGTM!The new field
IsEligibleForMasterElection
is correctly being set based on the replication state.
535-540
: LGTM!The conditional check for
IsEligibleForMasterElection
ensures the logic aligns with the PR objectives.
… decision in codis-sentinel (#2766) * add replication * add metric 'is_eligible_for_master_election' to indicate whether the instance has corrupted full sync, which can be used in codis-pika cluster reelection scenario * add IsEligibleForMasterElection * IsEligibleForMasterElection slaveof force --------- Co-authored-by: liuyuecai <[email protected]> Co-authored-by: cjh <[email protected]>
… decision in codis-sentinel (OpenAtomFoundation#2766) * add replication * add metric 'is_eligible_for_master_election' to indicate whether the instance has corrupted full sync, which can be used in codis-pika cluster reelection scenario * add IsEligibleForMasterElection * IsEligibleForMasterElection slaveof force --------- Co-authored-by: liuyuecai <[email protected]> Co-authored-by: cjh <[email protected]>
… decision in codis-sentinel (OpenAtomFoundation#2766) * add replication * add metric 'is_eligible_for_master_election' to indicate whether the instance has corrupted full sync, which can be used in codis-pika cluster reelection scenario * add IsEligibleForMasterElection * IsEligibleForMasterElection slaveof force --------- Co-authored-by: liuyuecai <[email protected]> Co-authored-by: cjh <[email protected]>
这个PR修复了 fix #2436
1 具体地,通过对外提供指标‘is_eligible_for_master_election’(通过info replication/info命令),告知本实例是否有资格在fail over是成为新master的候选者。
当前instance如果正在进行全量同步(作为slave),或者运行历史中的上一次全量同步没有做完(自己挂了或者主挂了),is_eligible_for_master_election都会为false,除此之外为true。
该指标如何判断之前是否有过意外中断的全量同步:利用了“internal-used-unfinished-full-sync”,这是一个set结构,借助了pika.conf来做持久化,每次有任何一个DB开始全量同步时,都会往里面插入自己的DBname并且立刻持久化,当任一个DB完成了全量同步,就会从中移除自己的DBname并且做持久化。所以,如果有一个从节点开始了全量同步,却在全量同步过程中发生意外导致全量同步没有做完,这个unfinished-full-sync中就会留下其DBname,我们也可以借此知悉该DB上一次全量同步是没有做完的(如果该DB目前不在进行全量同步,而unfinished-full-sync这个Set中确有其DBName,那这个DB上一次的全量同步就是意外中断,不完整的)。另外,如果用户清空了实例的replicationID,unfinished-full-sync也会被自动清空,因为这二者其实相互关联。
2 在codis-sentinel中加入了使用该指标的逻辑:
This PR fixes issue #2436:
Specifically, it introduces the metric 'is_eligible_for_master_election' (accessible via the info replication/info command) to indicate whether this instance is eligible to be a candidate for the new master during a failover.
If the current instance is performing a full synchronization (as a slave) or if the last full synchronization in its history was incomplete (due to its own failure or the master's failure), 'is_eligible_for_master_election' will be false. Otherwise, it will be true.
This metric's logic has been integrated into codis-sentinel:
Summary by CodeRabbit
New Features
IsEligibleForMasterElection
to enhance master election criteria, ensuring more reliable and optimized master selection in cluster scenarios.Improvements